add results for Granite Embedding English R2 models#256
add results for Granite Embedding English R2 models#256KennethEnevoldsen merged 1 commit intoembeddings-benchmark:mainfrom
Conversation
|
@aashka-trivedi, great to have this PR - I will mark this for review after the model has been merged in |
Model Results ComparisonReference models: Results for
|
| task_name | google/gemini-embedding-001 | ibm-granite/granite-embedding-english-r2 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AmazonCounterfactualClassification | 0.9269 | 0.6545 | nan | 0.9696 |
| AmazonPolarityClassification | nan | 0.6693 | 0.9326 | 0.9774 |
| AmazonReviewsClassification | nan | 0.3327 | nan | 0.6880 |
| AppsRetrieval | 0.9375 | 0.1396 | 0.3255 | 0.9375 |
| ArXivHierarchicalClusteringP2P | 0.6492 | 0.5906 | 0.5569 | 0.6869 |
| ArXivHierarchicalClusteringS2S | 0.6384 | 0.5736 | 0.5367 | 0.6548 |
| ArguAna | 0.8644 | 0.5921 | 0.5436 | 0.8979 |
| ArxivClusteringP2P | nan | 0.4864 | 0.4431 | 0.6092 |
| ArxivClusteringS2S | nan | 0.4459 | 0.3843 | 0.5520 |
| AskUbuntuDupQuestions | 0.6424 | 0.6648 | 0.5924 | 0.7020 |
| BIOSSES | 0.8897 | 0.8659 | 0.8457 | 0.9692 |
| Banking77Classification | 0.9427 | 0.8556 | 0.7492 | 0.9427 |
| BiorxivClusteringP2P | nan | 0.3889 | 0.355 | 0.5522 |
| BiorxivClusteringP2P.v2 | 0.5386 | 0.4184 | 0.372 | 0.5642 |
| BiorxivClusteringS2S | nan | 0.3718 | 0.333 | 0.5093 |
| COIRCodeSearchNetRetrieval | 0.8106 | 0.6465 | nan | 0.8951 |
| CQADupstackGamingRetrieval | 0.7068 | 0.6504 | 0.587 | 0.7861 |
| CQADupstackRetrieval | nan | 0.5 | 0.3967 | 0.6830 |
| CQADupstackUnixRetrieval | 0.5369 | 0.5285 | 0.3988 | 0.7198 |
| ClimateFEVER | nan | 0.3582 | 0.2573 | 0.5693 |
| ClimateFEVERHardNegatives | 0.3106 | 0.3597 | 0.26 | 0.4900 |
| CodeFeedbackMT | 0.5628 | 0.5254 | 0.4278 | 0.9370 |
| CodeFeedbackST | 0.8533 | 0.7718 | 0.7426 | 0.9067 |
| CodeSearchNetCCRetrieval | 0.8469 | 0.4767 | 0.7783 | 0.9635 |
| CodeTransOceanContest | 0.8953 | 0.7707 | 0.7403 | 0.9496 |
| CodeTransOceanDL | 0.3147 | 0.3503 | 0.3128 | 0.4419 |
| CosQA | 0.5024 | 0.3701 | 0.348 | 0.5218 |
| DBPedia | nan | 0.396 | 0.413 | 0.5350 |
| EmotionClassification | nan | 0.4131 | 0.4758 | 0.9387 |
| FEVER | nan | 0.8804 | 0.8281 | 0.9628 |
| FEVERHardNegatives | 0.8898 | 0.8892 | 0.8379 | 0.9453 |
| FiQA2018 | 0.6178 | 0.4632 | 0.4381 | 0.7991 |
| HotpotQA | nan | 0.6736 | 0.7122 | 0.8758 |
| HotpotQAHardNegatives | 0.8701 | 0.6708 | 0.7055 | 0.8701 |
| ImdbClassification | 0.9498 | 0.6191 | 0.8867 | 0.9737 |
| LEMBNarrativeQARetrieval | nan | 0.4785 | 0.2422 | 0.6070 |
| LEMBNeedleRetrieval | nan | 0.43 | 0.28 | 0.8800 |
| LEMBPasskeyRetrieval | 0.3850 | 0.8175 | 0.3825 | 1.0000 |
| LEMBQMSumRetrieval | nan | 0.4158 | 0.2426 | 0.5507 |
| LEMBSummScreenFDRetrieval | nan | 0.9365 | 0.7112 | 0.9782 |
| LEMBWikimQARetrieval | nan | 0.859 | 0.568 | 0.8890 |
| MSMARCO | nan | 0.3214 | 0.437 | 0.4812 |
| MTOPDomainClassification | 0.9927 | 0.9235 | 0.9097 | 0.9995 |
| MTOPIntentClassification | nan | 0.7104 | nan | 0.9551 |
| MassiveIntentClassification | 0.8846 | 0.7056 | 0.6804 | 0.9194 |
| MassiveScenarioClassification | 0.9208 | 0.7524 | 0.7178 | 0.9930 |
| MedrxivClusteringP2P | nan | 0.3303 | 0.317 | 0.5153 |
| MedrxivClusteringP2P.v2 | 0.4716 | 0.3615 | 0.3431 | 0.5179 |
| MedrxivClusteringS2S | nan | 0.3224 | 0.2976 | 0.4969 |
| MedrxivClusteringS2S.v2 | 0.4501 | 0.3574 | 0.3152 | 0.5106 |
| MindSmallReranking | 0.3295 | 0.3172 | 0.3024 | 0.3412 |
| MultiLongDocRetrieval | nan | 0.4156 | 0.3302 | 0.5099 |
| NFCorpus | nan | 0.3749 | 0.3398 | 0.5575 |
| NQ | nan | 0.5822 | 0.6403 | 0.8248 |
| QuoraRetrieval | nan | 0.8784 | 0.8926 | 0.9235 |
| RedditClustering | nan | 0.5322 | 0.4691 | 0.7716 |
| RedditClusteringP2P | nan | 0.5642 | 0.63 | 0.7527 |
| SCIDOCS | 0.2515 | 0.2495 | 0.1745 | 0.3453 |
| SICK-R | 0.8275 | 0.7134 | 0.8023 | 0.9465 |
| STS12 | 0.8155 | 0.6702 | 0.8002 | 0.9546 |
| STS13 | 0.8989 | 0.8409 | 0.8155 | 0.9776 |
| STS14 | 0.8541 | 0.7477 | 0.7772 | 0.9753 |
| STS15 | 0.9044 | 0.8537 | 0.8931 | 0.9811 |
| STS16 | nan | 0.7888 | 0.8579 | 0.9763 |
| STS17 | 0.9161 | 0.8621 | 0.8812 | 0.9586 |
| STS22 | nan | 0.685 | nan | 0.7310 |
| STS22.v2 | 0.6797 | 0.6847 | 0.6366 | 0.7497 |
| STSBenchmark | 0.8908 | 0.7917 | 0.8729 | 0.9370 |
| SciDocsRR | nan | 0.8816 | 0.8422 | 0.9114 |
| SciFact | nan | 0.758 | 0.702 | 0.8660 |
| SprintDuplicateQuestions | 0.9690 | 0.9463 | 0.9314 | 0.9787 |
| StackExchangeClustering | nan | 0.677 | 0.5837 | 0.8395 |
| StackExchangeClustering.v2 | 0.9207 | 0.5958 | 0.4643 | 0.9207 |
| StackExchangeClusteringP2P | nan | 0.3416 | 0.329 | 0.5157 |
| StackExchangeClusteringP2P.v2 | 0.5091 | 0.4014 | 0.3854 | 0.5509 |
| StackOverflowDupQuestions | nan | 0.5427 | 0.5014 | 0.6292 |
| StackOverflowQA | 0.9671 | 0.918 | 0.8889 | 0.9717 |
| SummEval | nan | 0.3152 | 0.2964 | 0.4052 |
| SummEvalSummarization.v2 | 0.3828 | 0.2931 | 0.3141 | 0.3893 |
| SyntheticText2SQL | 0.6996 | 0.4955 | 0.5307 | 0.7875 |
| TRECCOVID | 0.8631 | 0.7056 | 0.7115 | 0.9499 |
| Touche2020 | nan | 0.229 | 0.2313 | 0.3939 |
| Touche2020Retrieval.v3 | 0.5239 | 0.5343 | 0.4959 | 0.7465 |
| ToxicConversationsClassification | 0.8875 | 0.6208 | 0.6601 | 0.9759 |
| TweetSentimentExtractionClassification | 0.6988 | 0.5256 | 0.628 | 0.8823 |
| TwentyNewsgroupsClustering | nan | 0.479 | 0.394 | 0.8349 |
| TwentyNewsgroupsClustering.v2 | 0.5737 | 0.4777 | 0.3921 | 0.8758 |
| TwitterSemEval2015 | 0.7917 | 0.6006 | 0.7528 | 0.8946 |
| TwitterURLCorpus | 0.8705 | 0.8334 | 0.8583 | 0.9571 |
| Average | 0.7275 | 0.5821 | 0.5592 | 0.7726 |
Results for ibm-granite/granite-embedding-small-english-r2
| task_name | google/gemini-embedding-001 | ibm-granite/granite-embedding-small-english-r2 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AmazonCounterfactualClassification | 0.9269 | 0.6178 | nan | 0.9696 |
| AmazonPolarityClassification | nan | 0.6169 | 0.9326 | 0.9774 |
| AmazonReviewsClassification | nan | 0.3215 | nan | 0.6880 |
| AppsRetrieval | 0.9375 | 0.1354 | 0.3255 | 0.9375 |
| ArXivHierarchicalClusteringP2P | 0.6492 | 0.571 | 0.5569 | 0.6869 |
| ArXivHierarchicalClusteringS2S | 0.6384 | 0.5804 | 0.5367 | 0.6548 |
| ArguAna | 0.8644 | 0.544 | 0.5436 | 0.8979 |
| ArxivClusteringP2P | nan | 0.4802 | 0.4431 | 0.6092 |
| ArxivClusteringS2S | nan | 0.4407 | 0.3843 | 0.5520 |
| AskUbuntuDupQuestions | 0.6424 | 0.6483 | 0.5924 | 0.7020 |
| BIOSSES | 0.8897 | 0.865 | 0.8457 | 0.9692 |
| Banking77Classification | 0.9427 | 0.8363 | 0.7492 | 0.9427 |
| BiorxivClusteringP2P | nan | 0.389 | 0.355 | 0.5522 |
| BiorxivClusteringP2P.v2 | 0.5386 | 0.4088 | 0.372 | 0.5642 |
| BiorxivClusteringS2S | nan | 0.3633 | 0.333 | 0.5093 |
| COIRCodeSearchNetRetrieval | 0.8106 | 0.6046 | nan | 0.8951 |
| CQADupstackGamingRetrieval | 0.7068 | 0.6244 | 0.587 | 0.7861 |
| CQADupstackRetrieval | nan | 0.4783 | 0.3967 | 0.6830 |
| CQADupstackUnixRetrieval | 0.5369 | 0.5113 | 0.3988 | 0.7198 |
| ClimateFEVER | nan | 0.3156 | 0.2573 | 0.5693 |
| ClimateFEVERHardNegatives | 0.3106 | 0.3169 | 0.26 | 0.4900 |
| CodeFeedbackMT | 0.5628 | 0.5219 | 0.4278 | 0.9370 |
| CodeFeedbackST | 0.8533 | 0.7685 | 0.7426 | 0.9067 |
| CodeSearchNetCCRetrieval | 0.8469 | 0.4842 | 0.7783 | 0.9635 |
| CodeTransOceanContest | 0.8953 | 0.7763 | 0.7403 | 0.9496 |
| CodeTransOceanDL | 0.3147 | 0.3363 | 0.3128 | 0.4419 |
| CosQA | 0.5024 | 0.3558 | 0.348 | 0.5218 |
| DBPedia | nan | 0.3785 | 0.413 | 0.5350 |
| EmotionClassification | nan | 0.3457 | 0.4758 | 0.9387 |
| FEVER | nan | 0.8648 | 0.8281 | 0.9628 |
| FEVERHardNegatives | 0.8898 | 0.8758 | 0.8379 | 0.9453 |
| FiQA2018 | 0.6178 | 0.4081 | 0.4381 | 0.7991 |
| HotpotQA | nan | 0.6565 | 0.7122 | 0.8758 |
| HotpotQAHardNegatives | 0.8701 | 0.6623 | 0.7055 | 0.8701 |
| ImdbClassification | 0.9498 | 0.6037 | 0.8867 | 0.9737 |
| LEMBNarrativeQARetrieval | nan | 0.4132 | 0.2422 | 0.6070 |
| LEMBNeedleRetrieval | nan | 0.55 | 0.28 | 0.8800 |
| LEMBPasskeyRetrieval | 0.3850 | 0.7975 | 0.3825 | 1.0000 |
| LEMBQMSumRetrieval | nan | 0.3648 | 0.2426 | 0.5507 |
| LEMBSummScreenFDRetrieval | nan | 0.8991 | 0.7112 | 0.9782 |
| LEMBWikimQARetrieval | nan | 0.7995 | 0.568 | 0.8890 |
| MSMARCO | nan | 0.3013 | 0.437 | 0.4812 |
| MTOPDomainClassification | 0.9927 | 0.9015 | 0.9097 | 0.9995 |
| MTOPIntentClassification | nan | 0.6688 | nan | 0.9551 |
| MassiveIntentClassification | 0.8846 | 0.6708 | 0.6804 | 0.9194 |
| MassiveScenarioClassification | 0.9208 | 0.7279 | 0.7178 | 0.9930 |
| MedrxivClusteringP2P | nan | 0.329 | 0.317 | 0.5153 |
| MedrxivClusteringP2P.v2 | 0.4716 | 0.3646 | 0.3431 | 0.5179 |
| MedrxivClusteringS2S | nan | 0.3261 | 0.2976 | 0.4969 |
| MedrxivClusteringS2S.v2 | 0.4501 | 0.36 | 0.3152 | 0.5106 |
| MindSmallReranking | 0.3295 | 0.3042 | 0.3024 | 0.3412 |
| MultiLongDocRetrieval | nan | 0.4007 | 0.3302 | 0.5099 |
| NFCorpus | nan | 0.3714 | 0.3398 | 0.5575 |
| NQ | nan | 0.5537 | 0.6403 | 0.8248 |
| QuoraRetrieval | nan | 0.8736 | 0.8926 | 0.9235 |
| RedditClustering | nan | 0.5024 | 0.4691 | 0.7716 |
| RedditClusteringP2P | nan | 0.5492 | 0.63 | 0.7527 |
| SCIDOCS | 0.2515 | 0.2406 | 0.1745 | 0.3453 |
| SICK-R | 0.8275 | 0.6886 | 0.8023 | 0.9465 |
| STS12 | 0.8155 | 0.6741 | 0.8002 | 0.9546 |
| STS13 | 0.8989 | 0.8057 | 0.8155 | 0.9776 |
| STS14 | 0.8541 | 0.7294 | 0.7772 | 0.9753 |
| STS15 | 0.9044 | 0.8386 | 0.8931 | 0.9811 |
| STS16 | nan | 0.7799 | 0.8579 | 0.9763 |
| STS17 | 0.9161 | 0.8518 | 0.8812 | 0.9586 |
| STS22 | nan | 0.6684 | nan | 0.7310 |
| STS22.v2 | 0.6797 | 0.6685 | 0.6366 | 0.7497 |
| STSBenchmark | 0.8908 | 0.771 | 0.8729 | 0.9370 |
| SciDocsRR | nan | 0.8754 | 0.8422 | 0.9114 |
| SciFact | nan | 0.7549 | 0.702 | 0.8660 |
| SprintDuplicateQuestions | 0.9690 | 0.9493 | 0.9314 | 0.9787 |
| StackExchangeClustering | nan | 0.6603 | 0.5837 | 0.8395 |
| StackExchangeClustering.v2 | 0.9207 | 0.5828 | 0.4643 | 0.9207 |
| StackExchangeClusteringP2P | nan | 0.3509 | 0.329 | 0.5157 |
| StackExchangeClusteringP2P.v2 | 0.5091 | 0.4068 | 0.3854 | 0.5509 |
| StackOverflowDupQuestions | nan | 0.5405 | 0.5014 | 0.6292 |
| StackOverflowQA | 0.9671 | 0.9004 | 0.8889 | 0.9717 |
| SummEval | nan | 0.287 | 0.2964 | 0.4052 |
| SummEvalSummarization.v2 | 0.3828 | 0.2674 | 0.3141 | 0.3893 |
| SyntheticText2SQL | 0.6996 | 0.4633 | 0.5307 | 0.7875 |
| TRECCOVID | 0.8631 | 0.6467 | 0.7115 | 0.9499 |
| Touche2020 | nan | 0.2417 | 0.2313 | 0.3939 |
| Touche2020Retrieval.v3 | 0.5239 | 0.5625 | 0.4959 | 0.7465 |
| ToxicConversationsClassification | 0.8875 | 0.5937 | 0.6601 | 0.9759 |
| TweetSentimentExtractionClassification | 0.6988 | 0.5005 | 0.628 | 0.8823 |
| TwentyNewsgroupsClustering | nan | 0.4603 | 0.394 | 0.8349 |
| TwentyNewsgroupsClustering.v2 | 0.5737 | 0.4477 | 0.3921 | 0.8758 |
| TwitterSemEval2015 | 0.7917 | 0.5815 | 0.7528 | 0.8946 |
| TwitterURLCorpus | 0.8705 | 0.8277 | 0.8583 | 0.9571 |
| Average | 0.7275 | 0.5658 | 0.5592 | 0.7726 |
|
No immediately concerning results here - everything looks within a reasonable range given training data annotations |
Adds results for Granite Embedding English R2 models
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here